Log browsing moves Data organization Examples 


Organizing and analyzing logdata with entropy 


Sergey Bratus, Ph.D. 


Institute for Security Technology Studies 
Dartmouth College 


□ ► < g ► <!► <!► 1 -0 0.0 


Sergey Bratus Organizing and analyzing logdata with entropy 





Log browsing moves 

Data organization Examples '] 

Outline 



^ Log browsing moves 
a Pipes and tables 

a Trees are better than pipes and tables! 

G Data organization 

a Trying to define the browsing problem 
a Entropy 

a Measuring co-dependence 
a Mutual Information 
a The tree building algorithm 

G Examples 


□ 






Sergey Bratus Organizing and analyzing logdata with entropy 



Log browsing moves Data organization Examples 

What is this about? 


Why? 


Q To design a better interface for browsing logs & packets 
Q A smarter interface that reacts to statistical properties of 
the data. 

» Show “anomalies” first 
« Show off correlations and where they break 


How? 


Design the browsing interface around 
a Trees: natural for decision & classification 
® Basic statistics for frequency distribution and correlation 
<* Entropy, conditional entropy, mutual information, ... 
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How it started 


« My wife ran a Tor node (kudos to Roger) 

« Kept getting frantic messages from admins: 

Your machine is compromised! There is IRC traffic! (IRC=evil) 

9 OK, but how would we really check if there is something 
besides the “normal” Tor mix? 

9 Ethereal isn’t much help: how many page-long filters can 
you juggle? 

9 Wanted a tool that made classification simple. 
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Disclaimer 


Q These are really simple tricks. 

O Not a survey of research literature (but see last slides), 
a You can do much cooler stuff with entropy & friends. 

O These tricks are for off-line browsing (“analysis"), not 
IDS/IPS magic. 

a but they might help you understand that magic. 


< □ ► « g ► < 1 ► « 1 ► 1 
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The UNIX pipe length contest 



/var/log/secure: 

Jan 

13 

21:11:11 

zion 

sshd[3213] 

Accepted password for root from 209.61.200.11 


Jan 

13 

21:30:20 

zion 

sshd[3263] 

Failed password for neo from 68.38.148.149 


Jan 

13 

21:34:12 

zion 

sshd[3267] 

Accepted password for neo from 68.38.148.149 


Jan 

13 

21:36:04 

zion 

sshd[3355] 

Accepted publickey for neo from 129.10.75.101 


Jan 

14 

00:05:52 

zion 

sshd[3600] 

Failed password for neo from 68.38.148.149 


Jan 

14 

00:05:57 

zion 

sshd[3600] 

Accepted password for neo from 68.38.148.149 


Jan 

14 

12:06:40 

zion 

sshd[5160] 

Accepted password for neo from 68.38.148.149 


Jan 

14 

12:39:57 

zion 

sshd[5306] 

Illegal user asmith from 68.38.148.149 


Jan 

14 

14:50:36 

zion 

sshd[5710] 

Accepted publickey for neo from 68.38.148.149 



And the question is: 

44 

68 . 38 . 148.149 

12 

129 . 10 . 75.101 

2 

129 . 170 . 166.85 

1 

66 . 183 . 80.107 

1 

209 . 61 . 200.11 

. 


Successful logins via ssh using 
password by IP address 


□ ► < g > <!► «!► 1 -0 0,0 
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where is my WHERE clause? 


What is this? 

SELECT COUNT (*) as cnt, 
GROUP BY ip ORDER BY 

ip FROM logdata 
cnt DESC 



var.log.secure 


serial 

| date 

| time 

| host 

| daemon 

| message 

1 P'd 

it 

| user | 

□ 10 

2006-01-13 

21:11:11 

zion 

sshd[3213] 

Accepted password for root from 209.61.200.11 

3213 

209.61.200.11 

root 

□ 11 

2006-01-13 

21:30:20 

zion 

sshd[3263] 

Failed password for neofrom 68.38.148.149 

3263 

68.38.148.149 

neo 

□ 12 

2006-01-13 

21:34:12 

zion 

sshd[3267] 

Accepted password for neo from 68.38.148.149 

3267 

68.38.148.149 

neo 

□ 13 

2006-01-13 

21:36:04 

zion 

sshd[3355] 

Accepted publickeyfor neofrom 129.10.75.101 

3355 

129.10.75.101 

neo 

□ 14 

2006-01-14 

00:05:52 

zion 

sshd[3600] 

Failed password for neofrom 68.38.148.149 

3600 

68.38.148.149 

neo 

□ 15 

2006-01-14 

00:05:57 

zion 

sshd[3600] 

Accepted password for neo from 68.38.148.149 

3600 

68.38.148.149 

neo 

□ 16 

2006-01-14 

12:06:40 

zion 

sshd[5160] 

Accepted password for neo from 68.38.148.149 

5160 

68.38.148.149 

neo 

□ 17 

2006-01-14 

12:39:57 

zion 

sshd[5306] 

Illegal user asmith from 68.38.148.149 

5306 

68.38.148.149 

asmith 

□ 18 

2006-01-14 

14:50:36 

zion 

sshd[5710] 

Accepted publickeyfor neofrom 68.38.148.149 

5710 

68.38.148.149 

neo 


cnt 

'P 

□ 44 

68.38.148.149 

□ 12 

129.10.75.101 

□ 2 

129.170.166.85 

□ 1 

66.183.80.107 

□ 1 

209.61.200.11 


(Successful logins via ssh using 
password by IP address) 
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Must... parse... syslog... 


Wanted: 


Free-text syslog records — > named fields 


Reality check 


« printf format strings are at developers’ discretion 

« 120+ types of remote connections & user auth in Fedora 
Core 


Pattern language 


sshd: 

Accepted %auth for %user from %host 
Failed %auth for %user from %host 
Failed %auth for illegal %user from %host 

ftpd: 

%host: %user[%pid]: FTP LOGIN FROM %host [%ip], %user 
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“The great cycle” 



O Filter 
Q Group 
Q Count 
Q Sort 

Q R i ns e Repeat 


grep userl /var/log/messages | grep ipl | grep ... 
awk -f script ... | sort | uniq -c | sort-n 


SELECT* FROM logtbl WHERE user = ’userl’ AND ip = ’ipl’ 
GROUP BY ... ORDER BY ... 


□ Mg m 1 m l ► l T)(\0 
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Outline 


^ Log browsing moves 

» Pipes and tables 

9 Trees are better than pipes and tables! 

^ Data organization 

9 Trying to define the browsing problem 
9 Entropy 

9 Measuring co-dependence 
9 Mutual Information 
9 The tree building algorithm 


□ 
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Can we do better than pipes & tables? 


Humans naturally think in classification trees: 
e Protocol hierarchies (e.g., Wireshark) 


v Ethernet 

v Logical-Link Control 

Spanning Tree Protocol 
Datagram Delivery Protocol 

Routing Table Maintenance Protocol 
Cisco Discovery Protocol 
v Internet Protocol 

Protocol Independent Multicast 
Internet Group Management Protocol 
v User Datagram Protocol 

Routing Information Protocol 
Bootstrap Protocol 
Address Resolution Protocol 


100 . 00 % 

70 . 83 % 

50 . 00 % 

18 . 75 % 

18 . 75 % 

2 . 08 % 

12 . 50 % 

4 . 17 % 

2 . 08 % 

6 . 25 % 

4 . 17 % 

2 . 08 % 

16 . 67 % 


Ethernet 



RTMP RIP BOOTP 


□ 


9 


■0 0 , 0 - 
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Can we do better than pipes & tables? 


Humans naturally think in classification trees: 

« Protocol hierarchies (e.g., Wireshark) 

® Firewall decision trees (e.g., iptables chains) 




Log browsing moves Data organization Examples 

Use tree views to show logs! 


Pipes, SQL queries — > branches / paths 


Groups nodes (sorted by count / weight), records leaves. 
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Use tree views to show logs! 


Pipes, SQL queries — > branches / paths 


Groups <-> nodes (sorted by count / weight), records leaves. 
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Use tree views to show logs! 


Pipes, SQL queries — > branches / paths 


Groups <-> nodes (sorted by count / weight), records leaves. 
Queries pick out a leaf or a node in the tree. 



grep 68.38.148.149 /var/log/secure 


grep asmith 

< □ ► < S 1 ► 


grep 


Sergey Bratus Organizing and analyzing logdata with entropy 




Log browsing moves Data organization Examples 

Use tree views to show logs! 


Pipes, SQL queries — > branches / paths 


Groups <-> nodes (sorted by count / weight), records leaves. 
Queries pick out a leaf or a node in the tree. 



grep 68.38.148.149 /var/log/secure | grep asmith | grep ... 

< g t <!► « 1 ► 1 - 00,0 
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Use tree views to show logs! 


Pipes, SQL queries — > branches / paths 


Groups <-> nodes (sorted by count / weight), records leaves. 
Queries pick out a leaf or a node in the tree. 



cnt | ip | 

□ 44 68 . 38 . 148.149 

□ 12 129 . 10 . 75.101 

□ 2 129 . 170 . 166.85 

□ 1 66 . 183 . 80.107 

□ 1 209 . 61 . 200.11 


grep 68.38.148.149 /var/log/secure | grep asmith | grep ... 

< g ► <!► <!► 1 ^>0,0 
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A “coin sorter” for records/packets 
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A “coin sorter” for records/packets 
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A “coin sorter” for records/packets 
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A “coin sorter” for records/packets 
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A “coin sorter” for records/packets 
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A “coin sorter” for records/packets 
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A “coin sorter” for records/packets 
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A “coin sorter” for records/packets 
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A “coin sorter” for records/packets 
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A “coin sorter” for records/packets 
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A “coin sorter” for records/packets 
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A “coin sorter” for records/packets 
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A “coin sorter” for records/packets 
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A “coin sorter” for records/packets 
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Classify — > Save — > Apply 



9 Build a classification tree 
from a dataset 
9 Save template 

9 Reuse on another 
dataset 


Sergey Bratus 
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Which tree to choose? 



Goal: best grouping 


How to choose the “best” grouping (tree shape) for a dataset? 
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Outline 


o Log browsing moves 
® Pipes and tables 

• Trees are better than pipes and tables! 

G Data organization 

9 Trying to define the browsing problem 

» Entropy 

9 Measuring co-dependence 
9 Mutual Information 
9 The tree building algorithm 

Q Examples 


□ 
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Trying to define the browsing problem 



The lines you need are only 
20 PgDns away: 

...each one surrounded by 
a page of chaff... 

...in a twisty maze of 
messages, all alike... 

...but slightly different, in 
ways you don’t expect. 
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Old tricks 



Sorting, grouping & filtering: 

® Shows max and min values in a field 
® Groups together records with the 
same values 


□ ► < 1 ► < 1 ► 1 
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Old tricks 


Sorting, grouping & filtering: 


® Drills down to an “interesting” group 



□ ► < S 
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Old tricks 


Sorting, grouping & filtering: 

® Shows max and min values in a field 
® Groups together records with the 
same values 

® Drills down to an “interesting” group 


Key problems: 


O Where to start? Which column or protocol feature to pick? 
Q How to group? Which grouping helps best to understand 
the overall data? 

Q How to automate guessing (1) and (2)? 
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Estimating uncertainty 


9 Most lines in a large log will not be examined directly, ever. 

9 One just needs to convince oneself that he’s seen 
everything interesting . 

9 “Jump straight to the interesting stuff”, compress the rest. 


Example 


A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A BA A A A A A A . . . 
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Estimating uncertainty 


9 Most lines in a large log will not be examined directly, ever. 

9 One just needs to convince oneself that he’s seen 
everything interesting . 

9 “Jump straight to the interesting stuff”, compress the rest. 


Example 


A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A BA A A A A A A . . . 
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Estimating uncertainty 


9 Most lines in a large log will not be examined directly, ever. 

9 One just needs to convince oneself that he’s seen 
everything interesting . 

9 “Jump straight to the interesting stuff”, compress the rest. 


Example 


A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A BA A A A A A A . . . 


The problem: 
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Entropy intuitions 


The number of bits to encode a data item under optimal 
encoding (asymptotically, in a very long stream) 


Uniforn distribution 



Non-uniforn 



ABBBACDBAADBBDCAAACBCDACBBADAD 
CBBBAABADA... =>• 2 bits/symbol 


BAAD BAAAAAABAAAC AB BAB AAAAAAAAA 
BAAAADBAAC... =>• 1.42 bits/symbol 


Entropy of English: 0.6 to 1 .6 bits per char (best compression). 
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The entropy of English? 


Depending on the model, 0.6 to 1 .6 bits per character. 


letters, unigrams XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD 
ZEWRTZYNSADXESYJRQY WGECIJJ 

bigrams OCRO HLI RGWR NMIELWIS EU LL NBNESEBYATH EEI ALHENHTTPA OOBTTVA NAH BRL OR L RW 
NILI E NNSBATEI Al NGAE ITF NNR ASAEV OIE BAINTHA HYROO POER SETRYGAIETRWCO 


trigrams ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONSIVE TUCOOWE AT 
TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE 

words, unigrams REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE 
THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE HAD MESSAGES 
bigrams THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS 
POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHOEVER 


trigrams’ THE BEST FILM ON TELEVISION TONIGHT IS THERE NO-ONE HERE WHO HAD A LITTLE BIT OF 
FLUFF 


Shannon’s experiment 


Based on how likely humans are to be wrong when predicting 
the next letter / word (the average number of guesses made to 
guess the next letter /word correctly) 
http://math.ucsd. edu/"crypto/java/ENTROPY/ 
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Automating old tricks (1) 


“Look at the most frequent and least frequent values” in a 
column or list. 


□ Mg m | m 1 ► 1 -0 0,0 
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Automating old tricks (1) 


“Look at the most frequent and least frequent values” in a 
column or list. 


o What if there are many columns and batches of data? 
« Which column to start with? How to rank them? 


It would be nice to begin with “easier to understand” columns or 
features. 


□ Mg m | m 1 ► 1 -0 0,0 
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Automating old tricks (1) 


“Look at the most frequent and least frequent values” in a 
column or list. 


o What if there are many columns and batches of data? 
« Which column to start with? How to rank them? 


It would be nice to begin with “easier to understand” columns or 
features. 


Suggestion: 


Q Start with a data summary based on the columns with 
simplest value frequency charts ( histograms) . 

Q Simplicity — * less uncertainty — > smaller entropy. 
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Trivial observations, visualized 



□ 




'O c^o 
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Outline 


Pipes and tables 

Trees are better than pipes and tables! 


o 


Data organization 

• Trying to define the browsing problem 

« Entropy 

» Measuring co-dependence 
« Mutual Information 
» The tree building algorithm 


□ 




'O c^o 
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Start simple: Ranges 


l 
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A frequency histogram 


^ count 

445 


/ \ 





I V 

Expression... 

^ Clear 1 V Apply 

Destination . 

Dst_port 

Protocol Info 

129.170.166.103 


ICMP Echo (ping) reqi 

129.170.166.103 

WWW 

TCP 54993 > www [ACl" 

129.170.166.103 

https 

TCP 54973 > https [! 

129.170.166.103 

WWW 

TCP 54973 > www [SYI 

129.170.166.103 

ftp 

TCP 54973 > ftp [SIT 

129.170.166.103 

netbios-ssn 

TCP 54973 > netbios 

129.170.166.103 

radmin-port 

TCP 54973 > ridmin-l 

129.170.166.103 

microsoft-ds 

TCP 54973 > racroso' 

129.170.166.103 

tcpmux 

TCP 54973 > tcpmux 

129.170.166. 103 

4000 

socks 

TCP 54973 > 4000 [S 

129.170.166.103 

TCP 54973 > socks [! 

129.170.166.103 

3182 

TCP 54973 > 3182 [S 

129.170.166.103 

8100 

TCP 54973 > 8100 [S' 

129.170.166.103 

8000 

TCP 54973 > 8000 [S' 

129.170.166.103 

ingreslock 

TCP 54973 > ingresli 




v y 



80 

1 ^ ^899 ^000 ^43 ^39 ^j 2 4 ^ 8000 8100 8080 _3128 1080 

dst_port (sorted by frequency) 


1 
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Start simple: Histograms 
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Probability distribution 


i count 


n. 


Count of packets in the “dst_port== 445" bin 


Total count 


N = n + ...+n. 

1 k 


“Probability” 
of a ~~ 
packet 
falling into 
the /- th bin 


n i 

n = — - i=i k 

Pi N ’ 




n 3 ... 


n„ 


^899 ^43 ^^9 ^524 l 80 q 0 8100 8080 312 8 1080 

dst_port (sorted by frequency) 


1 
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Definition of entropy 


Let a random variable X take values x -\ , x 2 , . . . , x k with 
probabilities pi , p 2 , . . . , Pk- 



O Entropy measures the uncertainty or lack of information 
about the values of a variable. 

Q Entropy is related to the number of bits needed to encode 
the missing information (to full certainty). 
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Why logarithms? 


Fact: 



If some object is more likely to be picked than others, 
uncertainty decreases. 
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Entropy on a histogram 


Interpretation 


Entropy is a measure of uncertainty about the 
O I I I I value of X 


O X = (.25 .25 .25 .25) : H(X) = 2 (bits) 


□ ► < g ► < 4 ? ► < f ► 1 -0 0.0 
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Entropy on a histogram 


Interpretation 


Entropy is a measure of uncertainty about the 
O I I I I value of X 



O X = (.25 .25 .25 .25) : H{X) = 2 (bits) 
Q X = (.5 .3 .1 .1): H(X) = 1.685 


□ ► < g ► < 4? ► < f ► 1 -0 0.0 
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Entropy on a histogram 


Interpretation 


Entropy is a measure of uncertainty about the 
O I I I I value of X 



O X = (.25 .25 .25 .25) : H{X) = 2 (bits) 
Q X = (.5 .3 .1 .1): H(X) = 1.685 
O X = (.8 .1 .05 .05) : H(X) = 1 .022 


□ > < 1 ► < 1 ► 1 -0 0,0 
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Entropy on a histogram 



Interpretation 


Entropy is a measure of uncertainty about the 
value of X 

O X = (.25 .25 .25 .25) : H(X) = 2 (bits) 
Q X = (.5 .3 .1 .1): H(X) = 1.685 
O X = (.8 .1 .05 .05) : H(X) = 1 .022 
O X = (1 0 0 0) : H{X) = 0 


□ ► < g ► < 4 ? ► < f ► 1 -0 0.0 
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Entropy on a histogram 



Interpretation 


Entropy is a measure of uncertainty about the 
value of X 


O X = (.25 .25 .25 .25) : H(X) = 2 (bits) 
Q X = (.5 .3 .1 .1): H(X) = 1.685 
O X = (.8 .1 .05 .05) : H(X) = 1 .022 

O X = (1 0 0 0) : H{X) = 0 

For only one value, the entropy is 0. 

When all N values have the same frequency, 
the entropy is maximal, log 2 N. 



Log browsing moves Data organization Examples 

Compare histograms 
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Start with the simplest 



□ ► < g ► <!► 1 
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A tree grows in Ethereal 
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Outline 


^ Log browsing moves 
® Pipes and tables 

• Trees are better than pipes and tables! 

^ Data organization 

9 Trying to define the browsing problem 
9 Entropy 

9 Measuring co-dependence 

9 Mutual Information 

• The tree building algorithm 



* □ 


<9 > < 1 ► < ! ► 
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Automating old tricks (2) 


“Look for correlations. If two fields are strongly correlated on 
average, but for some values the correlation breaks, look at 
those more closely”. 



Log browsing moves Data organization Examples 

Automating old tricks (2) 


“Look for correlations. If two fields are strongly correlated on 
average, but for some values the correlation breaks, look at 
those more closely”. 


9 Which pair of fields to start with? 

9 How to rank correlations? 

Too many to try by hand, even with a good graphing tool like R 
or Matlab. 
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Automating old tricks (2) 


“Look for correlations. If two fields are strongly correlated on 
average, but for some values the correlation breaks, look at 
those more closely”. 


9 Which pair of fields to start with? 

9 How to rank correlations? 

Too many to try by hand, even with a good graphing tool like R 
or Matlab. 


Suggestion: 


O Try and rank pairs before looking, and look at the simpler 
correlations first. 

Q Simplicity — > stronger correlation between features — ■> 
smaller conditional entropy. 
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Examples (1) 


Example 


Source IP of user logins: 

e Almost everyone comes in from a couple of machines 
« One user comes in from all over the place. Problem? 


Example 


Small network, SRCJP ~ TTL 
« On average, srcJp predicts ttl. 
o What if a host sends packets with all sorts of ttl? 
a A user just discovered trace route? 
a What if that machine is a printer or appliance ? 


□ ► < g ► <!► <!► l 
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Examples (2) 


MUD: Multi-user text adventure (like WoW in ASCII text, only 
better PvP) 


Example 


%user gets %obj [%objnum] in room %room 

a 2 rooms had by far the largest number of objects picked up. 
a Major source of money in the game was: robbers! 
a Stationary camp, safe area, close to cities, easy kill... 


Example 


Cheating: player killing by agreement for experience 
a A kills B repeatedly, often in the same room. Why? 

« A gets experience, warpoints, levels. B is used as a 
throw-away character, owner of B gets favors. 
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Histograms 3d: Feature pairs 



□ Mg m l ► < 1 ► 1 -0 0,0 
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Joint Entropy 


For fields X and Y, count # times n,y a pair (x ( .y 7 ). is seen 
together in the same record. 



yi 

72 


*1 

n 11 

ni 2 


X 2 

n 21 

n 22 





p( x i’yj) = jj, (N = J 2 n a) 


i,j 
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Log browsing moves Data organization Examples 

Joint Entropy 


For fields X and Y, count # times n,y a pair (x ( .y 7 ). is seen 
together in the same record. 
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Measure of mutual dependence 


o How much knowing X tells about Y (on average)? 
o How strong is the connection? 
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Log browsing moves Data organization Examples 

Dependence 


Independent variables X and Y: 


o Knowing X tells us nothing about Y 


Dependent X and Y\ 


□ ► < g ► <!► «!► 1 '0 0,0 
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Dependence 



Dependent X and Y\ 


o Knowing X tells us something about Y (and vice versa) 


□ ► < g ► <!► «!► 1 -O Q. O 
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Dependence 


Independent variables X and Y: 


9 Knowing X tells us nothing about Y 

9 No matter what x we fix, the histogram of Y’s values 
co-occurring with that x will be the same shape 


Dependent X and Y\ 


9 Knowing X tells us something about Y (and vice versa) 


□ ► < g ► <!► «!► 1 -O Q. O 
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Log browsing moves Data organization Examples 

Dependence 


Independent variables X and Y: 


9 Knowing X tells us nothing about Y 

9 No matter what x we fix, the histogram of Y’s values 
co-occurring with that x will be the same shape 


Dependent X and Y\ 


9 Knowing X tells us something about Y (and vice versa) 

9 Histograms of ys co-occurring with a fixed x have different 
shapes 


□ ► < g ► <!► «!► 1 -O Q. O 
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Dependence 


Independent variables X and Y: 


a Knowing X tells us nothing about Y 
a No matter what x we fix, the histogram of Y’s values 
co-occurring with that x will be the same shape 
a H(X, Y ) = H(X) + H(Y) 


Dependent X and Y\ 


a Knowing X tells us something about Y (and vice versa) 
a Histograms of ys co-occurring with a fixed x have different 
shapes 


□ ► < g ► <!► «!► 1 -O Q. O 
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Dependence 


Independent variables X and Y: 


a Knowing X tells us nothing about Y 
a No matter what x we fix, the histogram of Y’s values 
co-occurring with that x will be the same shape 
a H(X, Y ) = H(X) + H(Y) 


Dependent X and Y\ 


a Knowing X tells us something about Y (and vice versa) 
a Histograms of ys co-occurring with a fixed x have different 
shapes 

a H(X, Y) < H{X) + H(Y) 


n * < g > < 1 ► < 1 ► 1 ^)Q,0 
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Outline 


^ Log browsing moves 
® Pipes and tables 

• Trees are better than pipes and tables! 

G Data organization 

® Trying to define the browsing problem 
® Entropy 

® Measuring co-dependence 

® Mutual Information 

® The tree building algorithm 

^ Examples 


* □ 


<9 > < 1 ► < ! ► 
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Mutual Information 


H(x,y) 



Definition 


Conditional entropy of Y given X 

H(Y\X) = H{X , Y) - H{X) 
Uncertainty about Y left once we know X. 


□ ► < S ► < 1 ► « 1 ► 1 -0 0,0 
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Mutual Information 


H(x,y) 



Definition 


Conditional entropy of Y given X 

H(Y\X) = H(X , Y) - H{X) 
Uncertainty about Y left once we know X. 


H(x,y) 



Definition 


Mutual information of two variables X and Y 

l(X; Y) = H{X) + H(Y)~ H(X, Y) 

Reduction in uncertainty about X once we 
know Y and vice versa. 


□ ► < g ► < 4 ? ► « S ► 1 -0 0.0 
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Histograms 3d: Feature pairs, Port scan 
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Histograms 3d: Feature pairs, Port scan 



□ ► < g ► < ! ► < f ► i 
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Snort port scan alerts 



□ ► < s 


'O c^o 
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Snort port scan alerts 


r 


TreeView2 source: Kerf/data/snort2.log 




File 

Edit 

View Help 






339. ] 

Snort portscan alerts 



Fields 1 slate 1 Features 1 Top 

Index 

> 

[55/1135] dst_port; 445 srcjp: 55 dstjp: 75 src_port: 100+ 


Field 

# 

Value 

* 

> 

[8/70] 

dst_port: 80 srcjp: 8 dstjp: 30 src_port: 63 



%dst port %dst ip 

194 

1 129.1‘ 


> 

[1/26] 

dst_port: 21 srcjp: (80.141.141.173) dstjp: 11 

src_port: 11 


%dst port %src ip 

76 

1 209. i: 


> 

[1/22] 

dst_port: 4899 srcjp: (218.103.195.242) dstjp: 

22 src_port: 22 


%dst port %src port 

687 

1 1551, 


> 

[2/20] 

dst_port: 4000 srcjp: 2 dstjp: 8 src_port: 15 



dst ip 

75 

129.170 


> 

[1/15] 

dst_port: 443 srcjp: (211.5.239.5) dstjp: 9 src 

_port: 9 


dst port 

14 

1. 21, 8( 


> 

[1/15] 

dst_port: 139 srcjp: (129.170.125.243) dstjp: 8 

src_port: 8 


flags 


******s 


> 

[1/12] 

dst_port: 1524 srcjp: (192.139.15.34) dstjp: 12 

src_port: (1524) 


loghost 


annon 


> 

[1/9] 

dst_port: 1 srcjp: (209.15.84.72) dstjp: 9 src_ 

port: 9 


program 


snort 


► 

[1/3] 

dst_port : 8 100 srcjp :(194.208.40.120) dstjp : 2 

src_port: 2 


repeat 


> 

[1/3] 

dst_port: 8000 srcjp: (194.208,40.120) dstjp: 2 

src_port: 2 


rule id 


732c5ec 


> 

[1/3] 

dst_port: 8080 srcjp: (194.208.40.120) dstjp: 2 

src_port: 2 


serial 


-1 


> 

[1/3] 

dst_port: 3128 srcjp: (194.208.40.120) dstjp: 2 

src_port: 2 


src ip 

71 

4.40.45. 


> 

[1/3] 

dst_port: 1080 srcjp: (194.208.40,120) dstjp: 2 

src_port: 2 


src_port 

668 

1027, 1C 







type 


SYN 







4 


1 

| autosplit via minentdep3 without mark 



Compute ranaes 

1 I 



autosplit via minentdep3 without mark: -- OK 


i 






Split 



L 



1 


1. 
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Snort port scan alerts 


r 

File 

Edit 

View Help 

TreeView2 source: Kerf/data/snort2.log 



1- [1339. 1 

Snort portscan alerts 

Fields 1 Slate 1 Features Top 

Inc 


> 

[1135] 

dst_port: 445 

srcjp: 55 dstjp: 75 src_port: 100+ 

Field 

# Value 

- 


> 

[70. 1] 

dst_port: 80 srcjp: 8 dstjp: 30 src_port: 63 

id 

100+ e5b80313j 



> 

[26] 

dst_port: 21 srcjp: (80.141.141.173) dstjp: 11 src_port: 11 

month 

Apr 



> 

[22] 

dst_port: 4899 

srcjp: (218.103.195.242) dstjp: 22 src_port: 22 

proqram 

snort 



t> 

[20] 

dst_port: 4000 

srcjp: 2 dstjp: 8 src_port: 15 

rule id 

732c5ed3‘. 



> 

[15] 

dst_port: 443 

srcjp: (211.5.239.5) dstjp: 9 src_port: 9 

timestan 100+ Fri 11-Apr- 



> 

[15] 

dst_port: 139 

srcjp: (129.170.125,243) dstjp: 8 src_port: 8 

_year 

2003 



> 

[12] 

dst_port: 1524 

srcjp: (192.139.15.34) dstjp: 12 src_port: (1524) 

dstjp 

75 129.170.1 



> 

[9] 

dst_port: 1 srcjp: (209.15.84,72) dstjp: 9 src_port: 9 

dst_port 

14 1. 21. 80, 



> 

[3.2] 

dst_port: 8100 

srcjp: (194.208.40,120) dstjp: 2 src_port: 2 

flags 

****** 2 * 



> 

[3. 2] 

dst_port: 8000 

srcjp: (194.208.40.120) dstjp: 2 src_port: 2 

loghost 

annon 



> 

[3. 2] 

dst_port: 8080 

srcjp: (194.208.40.120) dstjp: 2 src_port: 2 

mark 

pos 



> 

[3. 2] 

dst_port: 3128 

srcjp: (194.208.40,120) dstjp: 2 src_port: 2 

program 

snort 



> 

[3. 2] 

dst_port: 1080 

srcjp: (194.208.40,120) dstjp: 2 src_port: 2 

repeat 




□ ► < g ► < 4 ? ► < f ► 1 -0 0.0 
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Outline 


o Log browsing moves 
® Pipes and tables 

• Trees are better than pipes and tables! 

^ Data organization 

9 Trying to define the browsing problem 
9 Entropy 

9 Measuring co-dependence 
9 Mutual Information 

9 The tree building algorithm 

^ Examples 
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Building a data view 


O Pick the feature with lowest non-zero 
entropy (“simplest histogram”) 


min H(X) 
>0 

I 

X? 


□ ► < g ► < 4 ? ► < f ► 1 -0 0.0 
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Building a data view 


9 


Pick the feature with lowest non-zero 
entropy (“simplest histogram”) 

Split all records on its distinct values 


dst_poil 


min H(YI 
dst_port ) 


Y? 
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Building a data view 


s 

O Pick the feature with lowest non-zero . 

entropy (“simplest histogram”) j 

9 Split all records on its distinct values d*_p° rt 

O Order other features by the strength [ 

of their dependence with with the 
first feature (conditional entropy or 
mutual information) 

z? 


min H(ZI 
src_ip) 


□ 


9 
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Building a data view 



O Pick the feature with lowest non-zero 

s 

1 


entropy (“simplest histogram”) 

\ 


O Split all records on its distinct values 

dst_port 

Q Order other features by the strength 

/ 


of their dependence with with the 


first feature (conditional entropy or 

src_ip 


mutual information) 

min H(ZI 


O Use this order to label groups 

\ dst_port) / 
Z? 



« □ ► 4 ► < 1 ► 4 

? ► i 
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Building a data view 


Q Pick the feature with lowest non-zero 
entropy (“simplest histogram”) 

O Split all records on its distinct values 
O Order other features by the strength 
of their dependence with with the 
first feature (conditional entropy or 
mutual information) 

O Use this order to label groups 
Q Repeat with next feature in (1) 


s 

I 

dst_port 

1 

src_ip 

l 

dst_ip 

min(TI.) 

T 

T? 


□ 
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Log browsing moves Data organization Examples 

Snort port scan alerts 


r 


TreeView2 source: Kerf/data/snort2.log 




File 

Edit 

View Help 






339. ] 

Snort portscan alerts 



Fields 1 slate 1 Features 1 Top 

Index 

> 

[55/1135] dst_port; 445 srcjp: 55 dstjp: 75 src_port: 100+ 


Field 

# 

Value 

* 

> 

[8/70] 

dst_port: 80 srcjp: 8 dstjp: 30 src_port: 63 



%dst port %dst ip 

194 

1 129.1‘ 


> 

[1/26] 

dst_port: 21 srcjp: (80.141.141.173) dstjp: 11 

src_port: 11 


%dst port %src ip 

76 

1 209. i: 


> 

[1/22] 

dst_port: 4899 srcjp: (218.103.195.242) dstjp: 

22 src_port: 22 


%dst port %src port 

687 

1 1551, 


> 

[2/20] 

dst_port: 4000 srcjp: 2 dstjp: 8 src_port: 15 



dst ip 

75 

129.170 


> 

[1/15] 

dst_port: 443 srcjp: (211.5.239.5) dstjp: 9 src 

_port: 9 


dst port 

14 

1. 21, 8( 


> 

[1/15] 

dst_port: 139 srcjp: (129.170.125.243) dstjp: 8 

src_port: 8 


flags 


******s 


> 

[1/12] 

dst_port: 1524 srcjp: (192.139.15.34) dstjp: 12 

src_port: (1524) 


loghost 


annon 


> 

[1/9] 

dst_port: 1 srcjp: (209.15.84.72) dstjp: 9 src_ 

port: 9 


program 


snort 


► 

[1/3] 

dst_port : 8 100 srcjp :(194.208.40.120) dstjp : 2 

src_port: 2 


repeat 


> 

[1/3] 

dst_port: 8000 srcjp: (194.208,40.120) dstjp: 2 

src_port: 2 


rule id 


732c5ec 


> 

[1/3] 

dst_port: 8080 srcjp: (194.208.40.120) dstjp: 2 

src_port: 2 


serial 


-1 


> 

[1/3] 

dst_port: 3128 srcjp: (194.208.40.120) dstjp: 2 

src_port: 2 


src ip 

71 

4.40.45. 


> 

[1/3] 

dst_port: 1080 srcjp: (194.208.40,120) dstjp: 2 

src_port: 2 


src_port 

668 

1027, 1C 







type 


SYN 







4 


1 

| autosplit via minentdep3 without mark 



Compute ranaes 

1 I 



autosplit via minentdep3 without mark: -- OK 


i 






Split 



L 



1 


1. 


Sergey Bratus Organizing and analyzing logdata with entropy 




Log browsing moves Data organization Examples 

Snort port scan alerts 


r 

File 

Edit 

View Help 

TreeView2 source: Kerf/data/snort2.log 



1- [1339. 1 

Snort portscan alerts 

Fields 1 Slate 1 Features Top 

Inc 


> 

[1135] 

dst_port: 445 

srcjp: 55 dstjp: 75 src_port: 100+ 

Field 

# Value 

- 


> 

[70. 1] 

dst_port: 80 srcjp: 8 dstjp: 30 src_port: 63 

id 

100+ e5b80313j 



> 

[26] 

dst_port: 21 srcjp: (80.141.141.173) dstjp: 11 src_port: 11 

month 

Apr 



> 

[22] 

dst_port: 4899 

srcjp: (218.103.195.242) dstjp: 22 src_port: 22 

proqram 

snort 



t> 

[20] 

dst_port: 4000 

srcjp: 2 dstjp: 8 src_port: 15 

rule id 

732c5ed3‘. 



> 

[15] 

dst_port: 443 

srcjp: (211.5.239.5) dstjp: 9 src_port: 9 

timestan 100+ Fri 11-Apr- 



> 

[15] 

dst_port: 139 

srcjp: (129.170.125,243) dstjp: 8 src_port: 8 

_year 

2003 



> 

[12] 

dst_port: 1524 

srcjp: (192.139.15.34) dstjp: 12 src_port: (1524) 

dstjp 

75 129.170.1 



> 

[9] 

dst_port: 1 srcjp: (209.15.84,72) dstjp: 9 src_port: 9 

dst_port 

14 1. 21. 80, 



> 

[3.2] 

dst_port: 8100 

srcjp: (194.208.40,120) dstjp: 2 src_port: 2 

flags 

****** 2 * 



> 

[3. 2] 

dst_port: 8000 

srcjp: (194.208.40.120) dstjp: 2 src_port: 2 

loghost 

annon 



> 

[3. 2] 

dst_port: 8080 

srcjp: (194.208.40.120) dstjp: 2 src_port: 2 

mark 

pos 



> 

[3. 2] 

dst_port: 3128 

srcjp: (194.208.40,120) dstjp: 2 src_port: 2 

program 

snort 



> 

[3. 2] 

dst_port: 1080 

srcjp: (194.208.40,120) dstjp: 2 src_port: 2 

repeat 




□ ► < g ► < 4 ? ► < f ► 1 -0 0.0 
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Quick pair summary 


TreeView2 source: Kerf/data/ssh-auth-2users 


File Edit View Help 


[6i7j 617 logins from mediaone.net 


^ [ 606 ] host: h000502032ae9.ne.mediaone.net 

> [589] user: josh tty: 4 

> [8] user: jos tty: 0 

> 13] user: tty: () 

> [3] user: johs tty: () 

> in user: (null) tty: 0 

> in user: r] tty: () 

> in user: josh A [[D tty: 0 

> HO] host: we-24-31-59-152.we.mediaone.net 

> [il host: h0010b565bb03.ne.mediaone.net 


user: 7 


Fields Slate 

Features Top Index 

Feature 

E # Entropy 



H(%user|%host) cH 0.000/1.00( 


user: (oleg) tty: 2 
user: (josh) tty: (ttypO) 


One ISP, 617 lines, 2 users, one tends to mistype. 
1 1 lines of screen space. 


□ ► < g ► <!► «!► 1 -0 0,0 
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Quick pair summary 


| File Edit 

TreeView2 source: Kerf/data/ssh-auth-2users 

View Help 


1 ’ [ 617 ] 

617 logins from mediaone.net 


I Fields 1 Slate Features Top 1 Index 

1 -<7 [ 606 ] 

host: h000502032ae9, ne.mediaone.net 

user: 7 tty: 5 

Feature E 

# Entropy 

> [589] user: josh tty: 4 

1 H(%host|%user) 

CH 0.169/1.18 

> [8] 

user: jos tty: () 


H(%user|%host) 

cH 0.000/1.00 

> [ 3 ] 

user: tty: () 




> [ 3 ] 

user: johs tty: () 




> [ 1 ] 

user: (null) tty: 0 




> [ 1 ] 

user: r] tty: () 




> [ 1 ] 

user: josh~[[D tty: () 




> [ 10 ] 

host: we-24-31-59-152.we.mediaone.net 

user: (oleg) tty: 2 



[ 1 ] 

host: h0010b565bb03.ne.mediaone.net 

user: (josh) tty: (ttypO) 



^ [ 1 ] 

user: josh tty: (ttypO) 




Jan 10 00:04:14 mystic syslog: LOGIN ON ttypO BY josh FROM hOOlOb 




One ISP, 617 lines, 2 users, one tends to mistype. 
1 1 lines of screen space. 


□ 
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Novelty changes the order 


TreeView2 source: Kerf/data/snort2b.log 

File Edit View Help 



TreeView2 source: Kerf/data/snort2b.log 


File Edit View (Help 


> [17292] 

> [7] 


Snort portscan alerts 


flags: **w*»S* type: (SYN) dst_port: 35 dstjp: 95 srcjp: 100+ 

flags: *°momo*sf type: (SYNFIN) dst_port: (21) dstjp: 7 srcjp: (142.26.217.6 




Fields | Slate J 

Features Top Index 

Feature E 

# 

* 

%src_port 

4876/17299/8.25 


%srcjp 

689/17299/6.169 


%dstjp 

95/17299/4.361/; 



□ ► < g ► <!► «!► l 
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Looking at Root-Fu captures 



□ Mg ► < l > « | ► 1 -o c\.o 


Sergey Bratus Organizing and analyzing logdata with entropy 




Log browsing moves Data organization Examples 

Looking at Root-Fu captures 



□ Mg m l ► < 1 ► 1 00,0 
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Comparing 2nd order uncertainties 


ii 

Compare uncertainties in each 



Protocol group: 


" 

O Destination: H = 2.9999 



Q Source: H = 2.8368 

© i 

1 

1 1 O Info: H = 2.4957 

® ® — DNS- Gyslo®j 

tcp 


_ “Start with the simpler view” 

O 1 

III 1 

< □ MS M | M 1 ► 1 


Sergey Bratus Organizing and analyzing logdata with entropy 



Log browsing moves Data organization Examples H 

Comparing 2nd order uncertainties 

o . 

ii 

Compare uncertainties in each 



Protocol group: 


" 

O Destination: H = 2.9999 



Q Source: H = 2.8368 

o i 


l|” O Info: H = 2.4957 

® ® — DNS- SyaiolG 

- 

j MySQL 

_ “Start with the simpler view” 

o 1 

1 0 iii'i 

< □ MS M | M 1 ► 1 -O O 
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Looking at Root-Fu captures 



□ ► < g ► <!► 1 -00,0 
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Looking at Root-Fu captures 



□ ► < 1 ► < I ► 1 -o o 
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Looking at Root-Fu captures 



□ ► < g ► <!► l -00,0 
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Screenshots (1) 


Possible Attributes 
^ Protocol attributes 

k Ethernet 

> IP 

> LLC 

> NBNS 

> STP 

> UDP 

^ Packet list attributes 

packetlist.number - Number 
packetlist.time - Time (format as 
packetlist.time_relative - Relative 
packetlist.time_absolute - Absolu 
packetlist.abs_data_time - Absoli 
packetlist.time_delta - Delta time 
packetlist.source_address - Soun 
packetlist.resolved_src_addr - Sri 
packetlist.unresolved_src_addr - 
packetlist.haiware_src_addr - Ha 
packetlist.resolved_hw_src_addr 
packetlist.unresolved_hw_src_ad 


; Selected Attributes 
l> Protocol attributes 
Packet list attributes 


Order Attributes 
packetlist.protocol 


-i> UP | 


Q Delete | 


Q Delete | 


Rename attributes 
Name: | 


Down | 


New Attribute 


Select Learning Algorithm 


Algorithm: | Minimum Entropy Tree (JSV tree) 

Parameters:|-max_tree_depth=10 -max_cluster_size=10 -I ower_entropy_threshold=0. 000000 -upper_entropy_threshold=0.800000 

(9 Help I 0§ave I to Open 


I^J 


'J 


<9 OK 
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Screenshots (2) 


Protocol HwAddr 


1 

0.000000 

192.168 

2.3 195.138.145.122 

UDP 

00:11:50:38:81 

70 


Mark Packet (toggle) 
Time Reference 


mm 1 1 ». -Jmm » t j ; R t Kir i * mM 

3 

0.061035 

192.168 


UDP 

00:11:50:38:81 

70 

4 

0.072645 

195.138 


UDP 

00 : Od : 60 : 76 : d9 

ce 

5 

0.090327 

195.138 

Find in Tree view 

1 

UDP 

00 : Od: 60 : 76 : d9 

ce 

6 

0.091400 

192.168 

Apply as Filter 


UDP 

00:11:50:38:81 

70 

7 

0.119496 

192.168 


UDP 

00:11:50:38:81 

70 

8 

0.121801 

195.138 

Prepare a niter 


UDP 

00:0d:60:76:d9 

ce 

9 

0.149736 

192.168 



UDP 

00:11:50:38:81 

70 

10 

0.159995 

195.138 

%! Decode As... 
Print... 


UDP 

00 : Od: 60 : 76 : d9 

ce 

11 

0.177547 

192.168 


UDP 

00:11:50:38:81 

70 

12 

0 . 193140 

195.138 


UDP 

00 : Od : 60 : 76 : d9 

ce 

13 

0.208144 

192.168 

Show Packet in New Window 


UDP 

00:11:50:38:81 

70 

14 

0.215444 

195.138 

145.122 192.168.2.3 


UDP 

00 : Od: 60 : 76 : d9 

ce 


Ranges [Template | Messages | 

Field name [Ethereal formula | Unique values | Entropy ^ [ Values summa — ^ 



#undef 447 

Retransmission 2 


- I 


J 2 
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Screenshots (3) 



Ranges Template | Messages | 




Name 


| Label 

|Test 


^ root 


Min Entropy Tree 

□ 


packetlist. protocol 

"packetlist.protocol" %packetlist. protocol 

□ 


^ eth. trailer 

"eth.trailer" %eth.trailer 

[ %packetlist.protocol = "l 



eth.dst 

"eth.dst" %eth.dst 

[ %eth.trailer = "#undef’] 



leaf 

%line 

□ 



leaf 

%line 

□ 


- ttp-ack 

"tcp.ack" %tcp.ack 

[ %packetlist.protocol = ”> 








"packetlist. protocol" %packetlist. protocol 




Sort key: 

<® numjeaves 




Sort order 

desc 







Apply 
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Research links 


Research on using entropy and related measures for network 
anomaly detection: 

a Information-Theoretic Measures for Anomaly Detection , 
Wenke Lee & Dong Xiang, 2001 

a Characterization of network-wide anomalies in traffic flows , 
Anukool Lakhina, Mark Crovella & Christiphe Diot, 2004 

a Detecting Anomalies in Network Traffic Using Maximum 
Entropy Estimation , Yu Gu, Andrew McCallum & Don 
Towsley, 2005 
a ... 


□ ► ► < 1 ► « 1 ► 1 -0 0,0 
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Summary 


Information theory provides useful heuristics for: 
a summarizing log data in medium size batches, 
a choosing data views that show off interesting features of a 
particular batch, 

a finding good starting points for analysis. 

Helpful even with simplest data organization tricks. 


□ ► < g ► < 4 ? ► < f ► 1 -0 0.0 
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Summary 


Information theory provides useful heuristics for: 

« summarizing log data in medium size batches, 
a choosing data views that show off interesting features of a 
particular batch, 

9 finding good starting points for analysis. 

Helpful even with simplest data organization tricks. 


[ In one sentence 

1 

H{X), H(X\ Y), l(X\ Y),... 

parts of a complete analysis kit! 


□ ► < g ► < 4 ? ► < f ► 1 -0 0.0 
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Credits & source code 


Credits 

Kerf project: 

Javed Aslam, David Kotz, Daniela Rus, 


Ron Peterson 

Coding: 

Cory Cornelius, Stefan Savev 

Data & discussions: 

George Bakos, Greg Conti, 


Jason Spence, and many others. 

Sponsors: 

see website 


Code 


For source code (GPL), documentation, and technical reports: 
http://kerf.cs.dartmouth.edu 

Thanks! 
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